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1 Introduction 

The search of a prior distribution p(u>) to be used as part of an objective Bayesian 
analysis of a model p{x\u) has proved to be a formidable endeavour. This is an area 
where we do not have a definitive answer yet, and any contribution to the understanding 
of the subject must be welcome. The authors of this paper are among the most prominent 
contributors to this field, and reading the manuscript has been very stimulating. 

Research on the problem has mainly dealt with three issues: first, a definition of what 
a non-informative, reference or objective prior p(uj) must be; second, an operational 
algorithm to calculate such priors; third, the evaluation of the resulting prior(s) in 
accordance to certain criteria such as invariance, the avoidance of paradoxes, or desirable 
frequentist properties. 

To us, and this is a subjective judgment, the most convincing approach to produce 
this sort of priors is reference analysis (Bernardo, 1979; Berger and Bernardo, 1992a,b; 
Bernardo, 2005; Berger et ah, 2009). This procedure: (i) defines the reference prior as 
the prior maximizing the expected gain of information provided by a sample; (ii) in¬ 
cludes a general (although potentially involved) algorithm to calculate the prior; and 
(iii) avoids a number of paradoxes. Moreover, it generalizes the Jeffreys prior and ex¬ 
hibits its limitations. Among its most remarkable results, it shows that the form of the 
reference prior p(u>) may depend on the function of the parameters 6 = 0(uj) which is 
considered by the researcher to be of main interest. 

Since its inception, the algorithm to obtain reference priors has evolved. This is 
the case specifically in the multiparameter setting. The most recent version (Berger and 
Bernardo, 1992a,b) requires all scalar components of the parameter to be strictly ordered 
in terms of their inferential interest. Thus, in principle, the current approach does not 
offer any solution if the researcher is simultaneously interested in two or more scalar pa¬ 
rameters (or functions thereof). Interestingly, the original algorithm of Bernardo (1979) 
did cover this situation, although the solution was the multivariate Jeffreys prior which 
leads to unsettling paradoxes in some cases. 

In this paper, the authors explore some ideas to extend the reference analysis to 
this yet unsolved case. They also seem to be considering a more general version of 
the problem by assuming that the number of scalar parameters (or functions of the 
parameters) of interest may be greater than the number of parameters in the model. 
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So, the question is: What should the objective prior tt r (lo) (u e R fc ) be if there are m 
functions ( 6 i(uj), 82 ^), ■ ■ ■, 0 m (<*O) which are of simultaneous interest, where m is not 
constrained to be less than or equal to k? Three methods to produce the required prior 
distribution are discussed: (i) the common reference prior; (ii) the reference distance 
approach; and (iii) the hierarchical approach. 


2 Common reference prior 

This is not really a method. If the reference priors corresponding to 0i(u>) as the parame¬ 
ter of interest (i = 1,..., m ) are the same for any ordering of the remaining parameters, 
then the posed problem simply vanishes. It is interesting to see some examples illus¬ 
trating particular cases where the common prior exists, but it is desirable - and would 
be much more useful - to have general results characterizing sampling models where, 
for example, Theorem 2.1 applies and hence a common reference prior may be found. 
In this regard, results such as those in Gutierrez-Pena and Rueda (2003) and Consonni 
et al. (2004) could provide a good starting point. These authors find reference priors for 
wide classes of exponential families that include the family discussed in Section 2.1.3 of 
the present paper as a particular case. 

It must be pointed out that this section relies on the analysis of the information 
matrix 1(0), so all reviewed scenarios assume m < k. Also, a somewhat disquieting result 
is that of Section 2.2.2, where the authors show that Tr R (ipi, ip2, if 3, gi, g2) oc (if l ^) -1 is 
the one-at-a-time reference prior for any of these parameters and any possible ordering. 
In particular, it is the reference prior for the case where g 2 is the parameter of main 
interest. It so happens, however, that this prior is equivalent to the right-Haar prior 
which leads to a problematic posterior precisely for g 2 ■ This result would imply that, 
in general, reference analysis might produce inadequate posteriors for the parameter of 
interest , depending on the specific accompanying parameters. 


3 Reference distance method 

In order to introduce this method, the authors explicitly assume that 6 = u, hence 
m = k. The idea is to find an overall prior n(0) such that each of its marginal posteriors 
n(9i\x) is close to the corresponding marginal posterior iti(6i\x) obtained when 0i is the 
parameter of interest (i = 1,..., to). As a measure of approximation the authors propose 
a weighted average of expected logarithmic divergences, although other measures could 
in principle be used. Also, the search for the overall prior is restricted to a specific 
parametric family T = {7r(0|a),a G A}. Apart from the fact (acknowledged by the 
authors) that the existence of an optimal a is not guaranteed, a rather unappealing 
feature of this proposal is its dependence on the family J-. The authors offer no guidance 
on how to choose T in general. If the aim is to produce an objective approach, it seems 
desirable that T be somehow intrinsic to the sampling model. The examples in the 
paper suggest that perhaps this could be achieved through some kind of conjugacy. 

Incidentally, the reference distance method bears some resemblance to the mean- 
field approach to variational inference, which is relatively straightforward in the case of 
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exponential families with conjugate priors; see, for example, Bishop (2006, Chapter 10). 
What is the authors’ take on this? 

We would like now to comment on Example 3.2.4. There, the normal model N(x\/i, o) 
is considered, and the parameters of interest are p, o and <j> = p/c. r. (Note that, despite 
the authors’ remark at the beginning of Section 3, here 6 ^ u> and m > k.) In any 
case, the authors remind us that the reference prior when p or o is the parameter of 
interest is 7 r(p, o') = <t _ 1 , whereas the reference prior for <fi = p/o is given by 7r<^(//., o) = 
(2 o 2 + p 2 )~ 1 ^ 2 cr ~ 1 . They then propose, as a “natural” choice, the class of relatively 
invariant priors J- = {tt(p, cr) = cr~ a ; a > 0}. For this family, they show that the overall 
prior for (p, o, <fi) can be approximated by 7 r°(p, a) = o^ 1 , so that inclusion of <fi as an 
additional parameter of interest makes no difference. We find this rather disappointing. 
From an algorithmic point of view, this outcome is not surprising given the choice 
of T and the form of the reference priors for p and o. Only a large weight on the 
divergence corresponding to <j> could lead to a different result. An idea that springs to 
mind is to try another (arguably more “natural” family) such as Q = {tt(p, cr\ai, 02 ) = 
(2er 2 + p 2 )~ ai o~ a2 \ a\ > 0 ,a 2 > 0}, which includes all three reference priors for p, 0 
and 4>. On the other hand, since 7 r At (/x, o) and 7r (T (//, a) are equal in this case, the authors 
could alternatively have minimized the sum of the two divergences corresponding to 
the marginal posterior of and the joint posterior of (p,o)). We wonder how these 
alternative ideas compare with that proposed in the paper for this example. 


4 Hierarchical approach 

The idea of this approach is, first, to find a “natural” parametric family of proper priors 
■n( 6 \a) such that a£l and the integrated likelihood results in a proper density p(x\a). 
Then, the univariate reference prior for a, ir R (a), is obtained for this latter model. 
Finally, the overall prior 7 t°{9) is defined as the expectation of 7r(0|a) with respect 
to n R (a). This is an intuitive and seemingly reasonable idea. However, it is not clear 
how to make explicit that 9 is the parameter of interest even though the model is 
originally indexed by <*>, especially when the dimension of 6 is larger than that of u>. 
(See the comment below concerning the multi-normal means example.) We wonder if 
the authors can provide some advice on how this could be achieved in general. On 
the other hand, as in the reference distance case, dependence upon a specific family of 
priors introduces no small amount of arbitrariness in the method. Here, again, a proper 
objective method would use an intrinsic family entirely determined by the sampling 
model. One possibility, particularly suitable for the case of hierarchical models, would 
be to elaborate on the idea of conjugate likelihood distributions (George et ah, 1993), 
although a suitable restriction should be imposed on the corresponding conjugate family 
in order to get a one-dimensional hyperparameter. Concerning the implementation of 
the method, the authors suggest that integration to get the overall prior can be avoided 
by using 7 r°(9) = 7r(0|a) instead, where a is the mode of the posterior p(a\x). This 
proposal may be efficient from a computational point of view, but it is both surprising 
and disappointing since it essentially reduces the hierarchical approach to a standard 
empirical Bayes procedure and leads to a data-dependent prior. 
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The example in Section 4.2 concerning the multivariate hypergeometric model is 
confusing and does not quite illustrate the method described above. First, the param¬ 
eters of the sampling model are given a multinomial prior (which does not depend on 
a single scalar parameter a, but on a vector of probabilities p k )\ then, the likelihood is 
integrated and shown to yield a multinomial distribution. In the process, the k original 
parameters R\, R 2 , . - ., Rk are replaced by the parameters pi,p 2 , ■ ■ ■ ,Pk, so the idea of 
reducing the problem to the determination of the reference prior for a scalar parameter 
is abandoned. Next, in the multinomial model, the approximate overall prior obtained 
using the reference distance method is adopted for the hyperparameters p k . Finally, the 
corresponding integrated Multinomial- Dirichlet distribution is declared as the overall 
prior for R\, R 2 , ..., Rk- We find this ad hoc combination of methods difficult to justify 
as a general procedure. 

An alternative formulation could be based on the idea of super-populations (quite 
common in the field of survey sampling) as follows. Let us assume that a random sam¬ 
ple of size N is obtained from a multinomial distribution Mu k {Y k \l,p k ). As a result 
we get a vector R\, R 2 , . . ., Rk describing the number of sampled units in each cate¬ 
gory. Now imagine that we then get a subsample of size n, without replacement , from 
the sample of size N. In this setting, the multinomial distribution describes an infinite 
super-population, the sample of size N is the finite population of interest and the sub¬ 
sample of size n is the actual sample we observe. Given the sample, the likelihood based 
on the subsample corresponds to that of a hypergeometric distribution. However, with 
respect to the super-population, the subsample is just a sample of the original multino¬ 
mial population whose parameters are given by the vector p k . Within this framework, 
R\, R 2 , ■ ■ ., Rk are observables and any inference regarding these quantities must be 
produced through the corresponding posterior predictive distribution. This argument 
shows that the hypergeometric problem can be viewed as a multinomial one where the 
interest is not really on the parameters but on observables, and the relevant overall 
prior is that for p k , no matter which method we use. 

The example on the multi-normal means (Section 4.3) deserves a few words as well. 
Here, the parameters of interest are, using the same notation as the authors, pp. i = 
1,..., m and \p\ 2 = p\ + ■ ■ ■ + p^. First, we note that throughout the paper k refers 
to the dimension of lj and m is the number of parameters of interest (the dimension 
of 6 ), so in this example we have m = k + 1. It must be pointed out, however, that the 
hierarchical method, as defined, cannot be applied when m > k since the distribution 
7r(0|a) would then be defined over a space of functionally related components of 6 and 
would be singular. This fact is implicitly recognized by the authors when they propose 
a prior for (pi,..., p m ) only, ignoring the last parameter of interest, \p\ 2 - They then 
argue that the resulting overall prior is reasonable not only for each mean pi but also 
for \p\ 2 . The key issue here is the convenient choice of n(p\a) as the product of the 
normals N(pi\0,a). So, strictly speaking, this problem is not actually solved by using 
the hierarchical approach as proposed in the paper but by an ad hoc choice of n(p\a). 
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5 Final remarks 

This paper contains many interesting ideas and examples. However, it offers more of a 
brainstorming than a systematic treatment and a general solution to the problem. It is 
somewhat disappointing that the methods proposed in the paper bear little resemblance 
with the original reference prior approach, where the problem is clearly stated, the 
criterion used is sensible, and one can typically obtain unique and reasonable solutions. 
The approaches proposed here are still far from becoming operational algorithms since 
they require a number of arbitrary inputs. Hopefully, at least one of these methods will 
evolve into an overall objective procedure to find overall objective priors. We believe the 
reference distance method to be the most promising in this regard. 
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